You may have heard the phrase, "Your [Machine Learning] model is only as good as its training data". If not, no worries. In this tutorial, we'll explore how to get from raw data to a productionalize Machine Learning Application using the Machine Learning Lifecycle.
Machine Learning Lifecycle (MDLC)
Summary
Similar to the Software Development Lifecycle (SDLC), the Model Development Lifecycle (MDLC), or "Machine Learning Lifecycle, is used as an iterative process by Data Scientists when developing and productionalizing new Machine Learning Models.
The MDLC consists of the following steps:
Business Understanding
Before any action takes place, you need to understand the problem that is trying to be solved either by the company ou work for, or the outcome you are trying to produce.
This steps consists of
Gathering Requirements (e.g, tools, budget, team members)
Setting Goals and Milestones (e.g., when to check-in with stakeholders, what tasks to accomplish in which time frames)
Conducting an ROI analysis (this is used by businesses to understand if the benefits outweigh the costs of producing this model).
Model Selection
Once you have a clear understanding of the problem that needs to be solves or the outcome that needs to be produced, the following needs to take place:
Deciding if an ML Model is necessary. As mentioned in the video, if a rule-based system or a similar approach can be taken, it should, as it will almost always keep costs low and reduce necessary overhead (such as number of employees overseeing the model's performance).
Classification vs. Regression. If you decide that ML is the way to go, you need to decide if you need to predict a value based on a specified set of Features (discussed in a later tutorial), or classify items into a set number of classes based on a specified set of Features.
Can you use a ready-made solution? As the business world has evolved, large companies such as Amazon and Google have developed templated ML models on their respective Cloud Services. Some examples include Google's DialogFlow (for Natural Language Processing) and Vision AI (for Image Recognition). These solutions can save a lot of time and money, though if you're working with sensitive data, such as financial transactions, you should NOT use these pre-trained models.
Data Collection
If you work for a business, there will likely be an existing Data Warehouse or Data Lake with plenty of data available to train your model. However, if you're working on a personal project or creating an independent ML Application, you'll need to collect data from various sources. Listed below are the most common types of data sources:
APIs
Text Files (e.g., CSV, TSV, JSON)
Web Scraping
OLTP Databases (PostgreSQL, MySQL, MS SQL Server)
Data Annotation is the process of adding tags, or labels, to images to help Image Recognition models know what to look for when comparing images.
Data Preparation (Not Mentioned in the video)
Just because you collected data from a trustworthy source doesn't mean the data is in a valid format for model training. This step typically involves the following:
Building an ETL (Extract, Transform, Load) Pipeline to clean the data to fit your model's needs.
Building a Feature Store to view all potential features.
Creating Unit Tests for Data Pipelines to ensure the data meets certain criteria.
Exploratory Data Analysis (Not Mentioned in the video)
Once your data is clean to your organization's, or your personal, standards, you need to do a deep dive on your data. This step typically involves the following steps:
Create Box Plots to identify and remove any outliers.
Using SQL to identify and remove any NULL values.
Create Histograms to ensure your data fits a normal distribution (critical for linear models).
Create Correlation Matrices to eliminate collinearity.
Training, Tuning, and Evaluation
Assuming you're confident with the accuracy and quality of your Training Data, you can move on to fitting your Training Data to your model. This step typically involves the following steps:
Determine whether you want to use a Statistical Model or a Deep Learning Model
Hyperparameter Tuning
Evaluation on Test Dataset
Deployment
Once you're satisfied with your model's performance, you can move onto serving your model to your end users. This step typically involves the following steps:
Identifying a Serving Framework
Deployment to Cloud (typically via Docker)
Artifacts Management
Monitoring
The reason why the MDLC is called a "lifecycle" is because it goes in a continuous loop. Monitoring your model's performance as it's used in production is critical to maintaining a consistent infrastructure. This step typically involves the following steps:
Live Monitoring
Inference Accuracy Measurement
Set Notifications and Track Logs
Retrain
Following the concept of Continuous Integration/Continuous Deployment (CI/CD), you will want to add new features to your model and collect new training and testing data from your users to enhance your model's performance.
You should retrain your model based on production monitoring logs and changing business requirements.
Collect new data from user inputs and retrain your model with the new data.
Continuously perform hyperparameter tuning to ensure your model is performing at its best.
Although the MDLC can seem like a lot to handle, existing frameworks such as MLFlow, AWS SageMaker, and KubeFlow make monitoring your MDLC simple.